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An equalized global graph model-based approach 
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Abstract —Non-overlapping multi-camera visual object track¬ 
ing typically consists of two steps: single camera object tracking 
and inter-camera object tracking. Most of tracking methods focus 
on single camera object tracking, which happens in the same 
scene, while for real surveillance scenes, inter-camera object 
tracking is needed and single camera tracking methods can not 
work effectively. In this paper, we try to improve the overall 
multi-camera object tracking performance by a global graph 
model with an improved similarity metric. Our method treats the 
similarities of single camera tracking and inter-camera tracking 
differently and obtains the optimization in a global graph model. 
The results show that our method can work better even in the 
condition of poor single camera object tracking. 

Index Terms —Multi-camera multi-object tracking, global 
graph model, non-overlapping visual object tracking 

1. Introduction 

T racking objects of interest is an important and chal¬ 
lenging problem in intelligent visual surveillance sys¬ 
tems m. Since the visual surveillance systems provide huge 
amount of video streams, it is desirable that objects of interest 
can be automatically tracked by algorithms instead of human. 
Visual object tracking (Jl is a long-standing problem in 
computer vision, and there are a great amount of efforts made 
in visual object tracking within single cameras 01, El, S. In 
intelligent visual surveillance systems m, fa, due to the finite 
camera field of view, it is difficult to observe the complete 
trajectory of objects of interest in wide areas with only one 
camera. Hence, it is desired to enable the intelligent visual 
surveillance system to track the objects of interest within 
multiple cameras m. In addition, for practical considerations, 
the intelligent visual surveillance system usually holds the 
cameras installed with no overlapping areas. Thus, the intelli¬ 
gent visual surveillance system should be able to track objects 
of interest across multiple non-overlapping cameras. In this 
paper, we focus on addressing the problem of tracking objects 
of interest across multiple non-overlapping cameras. 

As shown in Fig. (Solution A), previous visual ob¬ 
ject tracking approaches tackle the problem in two different 
steps: single camera object tracking (SCT) (9), Col, CD 
and inter-camera object tracking (ICT) ca, ca, CD. SCT 
approaches 0 , cni, CD attempt to compute the trajectories 
of multiple objects from a single camera view, while ICT 
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approaches CD, ca, ca aim to find the correspondences 
among those trajectories across multiple camera views. These 
ICT approaches often use the trajectories obtained from SCT 
to achieve their data association, hence the overall tracking 
system is brittle and the overall performance depends on 
the results of the single camera object tracking module. For 
challenging scene videos, existing SCT approaches ca, ca, 
CD are also frangible since the results often contain fragments 
and false positives. The direct disturbance of these false 
positives and fragments bring problems into ICT module, such 
as wrong matching problem, i. e. two targets in Camera 2 are 
matched to different tracklets of a same target in Camera 1 (see 
Fig. (a)), and tracklet missing problem, i. e. some tracklets 
of a target are missing during inter-camera tracking (see Fig. 
1^ (b)). These problems are inevitable as long as the multi¬ 
camera object tracking is solved in two steps. We address these 
problems by integrating the two separate modules and jointly 
optimising them. 

We develop a global multi-camera object tracking approach. 
It integrates two steps together via an equalized global graph 
model to avoid these “inevitable” problems and aims to im¬ 
prove the overall performance of multi-camera object tracking. 

Considering two different steps, we evaluate the overall 
performance from the following two criteria: 

• Single camera object tracking: measuring how well the 
completed pedestrian trajectories in a single camera can 
be used to rebuild their exact historical paths in each 
scene. 

• Inter-camera object tracking: evaluating how well the 
inter-camera matching help to locate the pedestrians in 
a wide area. 

As shown in Fig. (Solution A), SCT and ICT share a 
similar data association framework: a graph modeling with 
an optimisation solution. In the single camera object tracking 
module, the data association inputs are the initial observa¬ 
tions, such as detections or tracklets, and the outputs are 
the integrated trajectories in each single camera (known as 
mid-term trajectories). These mid-term trajectories are then 
used as inputs to achieve the data association in inter-camera 
object tracking, and the outputs of the ICT approaches are 
the final integrated trajectories in multi-cameras (known as 
final trajectories). To integrate these two data associations, 
the straightforward idea is to establish a new data association 
which takes initial observations as inputs and outputs the 
final trajectories directly. However, a new problem arises, 
i. e. how to measure the similarity between two observations 
in the new graph. Some similarities are from the observations 
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Multi-camera object tracking — Feature (Detections) + Association = Completed Trajectories ] 


Fig. 1. Illustration of three types of multi-camera visual object tracking 
solution. 

which belong to the same camera, and others are from those 
belong to different cameras. If under the same similarity 
metric, the average similarity score between observations in 
different cameras would be commonly lowei[^than that from 
observations in the same camera, because the appearance 
information and the spatio-temporal information of objects 
are less reliable in ICT than those in SCT due to many 
factors (camera settings, viewpoints and lighting conditions). 
In this case, the optimisation process makes the graph give 
priority to linking the observations following the edges in the 
same camera instead of those across cameras, which would 
cause a failed optimized result for the whole multi-camera 
object tracking. To solve this problem, we have to handle 
two questions: how to distinguish the similarities in a same 
camera from those in different cameras, and how to balance 
them in the new graph? In this paper, we improve the similarity 
metric, make a difference between similarities of SCT and ICT, 
and equalize them in a global graph. A minimum uncertain 
gap lUSl is adopted to establish the improved similarity metric. 
Thanks to this, the similarity scores in both SCT and ICT are 
equalized in the proposed global graph model. 

The contributions of this papeij^are as follows. 

1) a global graph model for multi-camera object tracking is 
presented which integrates SCT and ICT steps together 
to avoid the “inevitable” problems; 

2) an improved similarity metric is proposed to equalize 
the different similarities in two steps and unify them in 
one graph; 

3) the proposed approach is experimented on a comprehen¬ 
sive evaluation criterion which clearly shows that our 
method is more effective than the traditional two-step 
multi-camera visual tracking framework. 

II. Related Work 

Using a graph model is an efficient and effective way to 
solve the data association problem in multi-camera visual 
object tracking. First, a graph modeling is used to form 
a solvable graph model with input observations (detections, 
tracklets, trajectories or pairs). It includes nodes, edges and 
weights. Then an optimisation solution is brought in to solve 

Uhe higher similarity score indicates a higher likelihood of the link for 
two observations. 

preliminary version of this paper appeared in Chen et al GSl and the 
source code is available in the link (https://github.com/cwhgn/EGTracker). 


the graph and obtains optimal or suboptimal solutions. The 
difference is that single camera object tracking (SCT) empha¬ 
sizes particularly on the graph and the optimisation solution, 
i. e. how to build a more efficient or more discriminative 
graph. While inter-camera object tracking (ICT) focuses on 
nodes, edges and weights, which prefers getting a more 
effective feature representation. The ICT has more complex 
and more sophisticated representations or similarity metrics 
(/. e. a transition matrix), but with a simpler graph model. The 
proposed approach takes advantages of both SCT and ICT. The 
proposed similarity metric is extended from a classical inter¬ 
camera tracking method 1^ and the global graph model takes 
advantage of a state-of-the-art SCT approach (211 . 

This section introduces related approaches for each part 
of SCT, ICT and MCT. Section 2.1 reviews the single cam¬ 
era multi-object tracking. Section 2.2 discusses the inter¬ 
camera object tracking with a brief introduction of object 
re-identification. Section 2.3 shows some other multi-camera 
object tracking approaches that take both SCT and ICT into 
account. 

A. Single Camera Object Tracking (SCT) 

In single camera multi-objects tracking, the prediction of 
the spatio-temporal information of objects is more reliable 
and the appearance of objects does not have many variations 
during tracking. This makes the SCT task less challenging 
than the ICT task. i. e. for some less challenging videos, a 
simple appearance representation (e.g. color histogram ca, 
1^ . 1^ ^ works well. The graph model is often used to 
solve different problems, such as occlusion t25\ , ||26l, crowd 
1^ . I27l and interference of appearance similarity (281, ESI- 
However, for challenging videos, these approaches lead to 
frequent id-switch errors and trajectory fragments. 

Existing approaches in SCT usually follow a data 
association-based tracking framework, which link short track- 
lets CH, (231, (^ or detection responses (3T1 . [321, (3^ 
into trajectories by a global optimization based on various 
kinds of features, such as motion (position, velocity) and 
appearance (color, shape). The improvements always develop 
from two aspects: the graph model and the optimization 
solution. Some researchers focus on developing a new graph 
model for their tracklets or detections and aim to solve a 
specific problem. In Possegger et al. (23, a geodesic method 
is adopted to handle the occlusion problem. Dicle et al. (iM 
use motion dynamics to solve generalized linear assignments 
when targets with similar appearances exist. Other works in 
SCT focus on the improvement of the optimization solution 
framework, such as continuous energy minimization (34l . 
linear programming 1^ . CRF (36l and the mixed integer 
program (771 . Zhang et al. (2T1 propose a maximum a 
posteriori (MAP) model to solve the data association of the 
multi-object tracking, while Yang et al. (3^ utilize an online 
CRF approach to handle the optimization with the benefit of 
distinguishing spatially close targets with similar appearances. 
These approaches can partly yield id-switches and trajectory 
fragments, but the separated optimisation makes them suffer 
from leaving many fragments and false positives to ICT step. 
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Cam 2 



(a) Wrong matching 



(b) Tracklet missing 


and texture 15^ . Recently, Li et al isa successfully apply 
CNN on Re-ID to extract an effective feature representation. 
However the highest identification rate is still below 0.3 under 
benchmarks and the approaches are also not practical. 

As we said, the ICT approaches have a common assumption 
that the single camera object tracking results are perfectly done 
and the trajectories in single cameras are all true positive and 
integrated completely. But until now, they are difficult to be 
achieved. 


Fig. 2. Illustration for the two matching problems. Blue and red lines 
indicates two targets and arrows show the best matching. Target B is matched 
to tracklet A2 wrongly in (a). Tracklet Al is missing in (b). 


B. Inter-camera Object Tracking (ICT) 

Inter-camera tracking is more challenging than SCT because 
of its greater dramatic changes in appearance caused by many 
factors (camera settings, viewpoints and lighting conditions) 
and less reliable spatio-temporal information in different cam¬ 
era views. As a result, how to learn a discriminative and 
invariant feature representation and a suitable similarity metric 
are the main problems in ICT. 

Most ICT works solve these problems from multi-camera 
calibration ISSll . 13^ . 1401 and feature cues ED, Ea, 
ES, ED, ES- For multi-camera calibration, as an im¬ 
mobile information, the approaches in this aspect always 
project the multiple scenes into a 3D coordinate system, and 
achieve the matching by using projected position information. 
Hu et al. 13^ adopt a principal axis-based correspondence 
to achieve the calibration. For feature cues, most approaches 
utilize improved appearance or spatio-temporal information 
to achieve the matching. Kuo et al. 14^ apply a multi¬ 
instance learning approach to learn an appearance affinity 
model, while Matei et al. ||43]| integrate appearance and spatio- 
temporal likelihoods within a multi-hypothesis framework. 

From the perspective of the graph modeling, a K-camera 
ICT data association can be treated as a K-partite graph 
matching problem. It is difficult to get the optimal solution, 
but there’re many approaches to get the suboptimal solutions, 
e.g. the weighted bipartite graph 1461 . the Hungarian algo¬ 
rithm 1471 and the binary integer program |[48l . The K-partite 
idea holds an assumption that each camera has had a perfect 
tracking result which should not be changed any more. In 
practice, the SCT result is not ideal and the assumption is 
broken. In this case, the SCT result should be modifiable and 
the data association is more like a global optimization problem 
than the K-partite graph matching problem. 

At the end of introducing ICT, it is worth mentioning that 
object re-identification (Re-ID) is an important part in ICT. 
When the topology of the camera network is not available or 
the scenes are not overlapped, the spatio-temporal information 
is invalid. In this case, the appearance cue is the only informa¬ 
tion can be used for matching. Studying object re-identification 
separately helps to better understand the capability of object 
matching by using visual features alone. Most object re¬ 
identification improvements mainly focus on some certain 
appearance of objects, such as color ||20|, 1491 . shape uni, ED 


C. Multi-camera Object Tracking (MCT) 

A good MCT is the ultimate goal for any researcher in 
tracking. Most MCT methods follow the two-step frame¬ 
work, a SCT algorithm plus an ICT algorithm. In the Multi- 
Camera Object Tracking Challenge l54l in ECCV 2014 visual 
surveillance and re-identification workshop, methods of most 
participating teams are two-step approaches. The winner USC- 
Vision team uses a state-of-the-art SCT method l32l and a 
state-of-the-art ICT method BTl . 

Besides two-step approaches, there’re some multi-camera 
object tracking approaches l55l . l56l . t57l . EH concentrating 
on integrating the processes of SCT and ICT into one global 
graph as this paper does. They mainly follow a tracking-by¬ 
detection paradigm and form a global association graph (see 
Fig. (Solution C)). Yu et al. l56l propose a nonnegative 
discretization solution for data association and identify people 
across different cameras by face recognition. While for real 
scenes with objects in a distant view, faces are too small to be 
recognized. Hofmann et al. tSSll use a global min-cost fiow 
graph and connect the different-view detections through their 
overlapping locations in a world coordinate space, which is 
not suitable for the non-overlapping camera problem. 

In this paper, the proposed method uses tracklet observations 
as the inputs instead of object detections, which are more 
reliable for matching. We consider the multi-camera object 
tracking as a global tracklet association under a panoramic 
view (see Fig. (Solution B)). And the similarities of dif¬ 
ferent tracklets in the global tracklet association are treated 
differently according to the cameras they belonging to. This 
framework provides a new solution for multi-camera object 
tracking when the SCT performance is not good enough for the 
further ICT process. Its local performance in a specific camera 
view may be as fragmentary as that of the traditional SCT 
methods, even the inter-camera information may provide some 
useful feedbacks for each specific camera. But it overcomes 
the new problems emerging in ICT when SCT is not good 
and offers a better ICT performance. In practice, a better 
ICT has stronger practical significance than SCT. For a video 
surveillance system, it’s more important to locate the objects 
in the whole wide area than a single scene. 

HI. Global Graph Model 

Our goal is to predict the trajectories by using the given 
series of observed videos. The proposed approach focuses on 
optimising single camera tracking and inter-camera tracking 
in one global data association process. The data association 
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Fig. 3. Illustration for the min-cost flow network. An example for the 
min-cost flow network with 3 timesteps and 6 tracklets. The number of E 
and W are 14, 21 and 21. 


is modeled as a global maximum a posteriori (MAP) prob¬ 
lem which is inspired by the same MAP formulation from 
Zhang et al. (TH . The difference is that the input in the 
proposed solution is tracklets rather than object detections. 
And the association aims to solve the wrong matching and 
the tracklet missing problems in ICT, while Zhang et al. ED 
apply it on SCT. We outline the variable definitions in Table 

HI 

In our approach, a single trajectory hypothesis is defined as 
an ordered list of target tracklets, i.e. T^ = 
where G L. The association trajectory hypothesis T is 
defined as a set of single trajectory hypothesises, i.e. T = {T^}. 
The objective of the data association is to maximize the 
posteriori probability of T given the tracklet set L under the 
non-overlapping constraints 1211 : 

r*=argmaxnP(/i|r) U P{Tk) 

^ i r,er (1) 

r,nr, = 0,vi^j. 

P{li\T) is the likelihood of tracklet li. The prior P{Tk) is 
modeled as a Markov chain containing transition probabilities 
YlP{hi+i\hi) of dll tracklets in l(58]| . 

The transition probability P{lj\li) is computed by using 
probabilities of the appearance feature Pa{k Ij) and the 
motion feature Pm{h 


P{lj\li) = P{k ^ Ij) = {Pa{k ^ ■ {Pmih ^ 

( 2 ) 

where ki and k 2 are the weights of two features. 

The MAP association model can be solved by a min-cost 
flow network da. The min-cost flow graph is formulated as 
G = {N,E,W}, where N,E,W stands for nodes, edges 
and weights respectively and the weight means the cost of 
linking the edge. In the graph G, there are two nodes 
and defined for each tracklet U. The observation edge 
from node to P^'^^ indicates the likelihood of tracklet li. 


TABLE I 

Notations of Equalized Global Graph Model 


li A single input tracklet consisted of several attributes, 

Ijj — 

L The set of all input tracklets, L = li, I 2 ,Im- 
Ti A single trajectory hypothesis consisted of an ordered list of 
target tracklets, T^ = {k^ ,li^, }. 

r* The output of the aglorithm which is the optimal set of trajectory 
hypothesis. 

G The min-cost flow graph, G = {N,E,W}. 

N The set of nodes in the graph, N = {S, T, 
i e [1,M]. 

E The set of edges in the graph, E = {ep U {e 3 %, Pit} U {cij} 
i G [1,M]. 

W The set of weights in the graph, W = U {wsi,WiT} U {wij} 
i G [1,M]. 

The MCSHR of tracklet U in the nth frame. 

Hi The incremental MCSHR for the whole tracklet li. 

The similarity between any MCSHR pair and hj. 

Ti The best periodic time for tracklet li. 


The corresponding observation weight Wi is set to the negative 
logarithm of the likelihood P{li\r). 

The possible linking relationship between any two tracklets 
is expressed as a transition edge eij from node P^^^ to node 
jenter ^ the transition weight Wij is the negative logarithm of 
the transition probability P{lj\li), as shown as follows. 


Wi = - log 


P{k\r) 

i-p{k\T)- 


(3) 


The transition weight can also be decomposed into proba¬ 
bilities in continuity of appearance and motion. 


Wij =-log P{lj\li) = -fcidog Paih^l j)-k 2 *log Pmih^lj)- 

(4) 

In addition to these nodes and edges, there are two extra 
nodes S, T. They are virtual source and sink for the min-cost 
flow graph. The enter/exit edges esi and ejT are also added 
in to represent the start tracklet k and the end tracklet Ij. The 
enter/exit weights of these tracklets are both set to 0 in this 
paper, because every tracklet could be equally a start or end 
with no cost. 

In summary, the number of nodes (N) is (2M -|- 2), and 
the numbers of edges E and weights W are smaller than the 
numbers of full connection graph (3M + 2 * (^^))- M is the 
total number of tracklets in all cameras. As shown in Fig. 

the graph is solved by the min-cost fiow, and the optimal 
solution is the maximum of the posteriori probability of T with 
the minimum cost. 

In the rest of this section, we introduce every part of the 
min-cost fiow graph, especially for the weights W. 


A. Nodes 

In the proposed approach, the tracklets extracted by a 
single-object tracking method are treated as input observations 
instead of detections. In other words, these tracklets are used 
to produce nodes in the global graph model. One of the 
reasons is that they have more information (like motion) than 
detections which only contain appearance information. With 
more information, they can be considered as more credible 
nodes and the similarities of them are more reliable. What’s 
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more, the number of the tracklets is much smaller than that 
of detections. It’s a good way to speed up the computing 
time of the graph optimization, which is also very important 
for practical usages. In this paper, the deformable part-based 
model (DPM) detector ||59l and an AIF tracker 1601 are first 
used to get all the tracklets from each camera. After obtaining 
detections by the DPM detector, we use the AIF tracker to 
track every target and get their tracklets. During the target 
tracking by the AIF tracker, a confidence at l(60l is calculated 
to evaluate the accuracy of a tracking result in frame t. If the 
confidence score is lower than the threshold 0, i. e. at < 0, 
the tracker is considered to be lost. Then all confidence values 
of the target in previous frames are recorded and the average 
value c is computed as the likelihood P{li\r) of tracklet 

^end 
_ ^start 

Ci = P{k\^) = ^start^ ’ 

where and are the start and end frames of tracklet 

k. 

So all the tracklets from all cameras are obtained, L = 
{/i,/ 2 , •••, , where each tracklet li = q, a^] 

consists of position, likelihood, camera view, time stamp and 
appearance information respectively. The nodes N can be 
expressed as: 
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Fig. 4. Illustration of computing the periodic time for a tracklet. An 

example for a tracklet with the length vj of 9 frames. The Avg Sim column 
shows the validity of every possible periodic time t. The maximum in Avg 
Sim column indicates the best periodic time r for this tracklet. 

C. Weights 

Weights are an essential attribution for links and used to 
represent relationships between nodes. In this paper, we import 
the similarities among tracklets as weights to indicate the cost 
of building links. As it mentioned above, the weights W are 
consisted of three parts, the same as edges: 

W = {wi}\J {wsi,WiT} U {Wij} ie [1,M] (8) 


N = {S, T, i e [1, M] 


( 6 ) 


B. Edges 

Edges are also an important part for the graph model. All the 
observation edges and enter/exit edges are reserved in the min- 
cost fiow graph. However, for the transition edges, only a part 
of it is retained because that not all the edges are meaningful. 
Three rules are built for selecting transition edges in our graph. 

Firstly, for edge e^y, the start frame of the tracklet Ij 
must be after the end frame of the tracklet li without any 
overlapping frame. This rule ensures the uniqueness of objects 
in every frame and keeps the edges directed. Secondly, the 
two tracklets k and Ij should come from the same camera or 
two cameras with an existing topological connection, which 
ensures the link of two tracklets possible from a panoramic 
view. Thirdly, a waiting time threshold rj is brought in to 
limit the link of two tracklets. If the time interval between 
two tracklets is long enough, longer than the threshold r], the 
likelihood of this link is close to zero. As a result, the edges 
that meet all requirements are selected and reserved, 

E = {ei}Ll{esi,eiT}U{eij} iG[l,M], 

0 < if < ri, (7) 

Topo{si,Sj) = 1 , 

where Topo{si, sj) = 1 means the camera views of Sj and sj 
have an existing topological connection. 

For all these selected edges E, the capacity is set to 0 or 
1, because every target should be at one and only one place 
in the same time. If the capacity is 1 in the optimal solution, 
which means this link exists and the two tracklets of this link 
belong to the same target. 


The observation weights can be obtained according to Eq. 
And the enter/exit weights are all set to 0 as mentioned above. 
In the transition weights, the appearance similarity Paik Ij) 
and the motion similarity Pm{h Ij) are used to form the 
weights. In the following we introduce them respectively. 


r Wi =-log lAiin =-iogT^^ 

) wsi=WiT = 0 i,jG[l,M], 

I Wij = -log P{lj\li) 

[ = -ki * log Pa{li Ij) -k 2 * log Pmik Ij). 

(9) 

1) Appearance Similarity: As shown in Section |I^ both 
SCT and ICT have their own representations and similarity 
metrics, while those in ICT methods are more sophisticated 
than those in SCT ones. In order to build an equalized metric, 
the proposed approach adopts an ICT representation. But it 
doesn’t use any learning process which strongly increases 
the computing time. This representation is called Piecewise 
Major Color Spectrum Histogram Representation (PMCSHR) 
03. It’s an improved version of Major Color Spectrum His¬ 
togram Representation (MCSHR) 1^ with some periodicity 
information that is specific to pedestrian. MCSHR obtains the 
major colors of a target based on an online k-means clustering 
algorithm. The original way of computing the MCSHR of a 
tracklet is to integrate histograms in all frames together. 


n=l 


( 10 ) 


where is the MCSHR of tracklet li in the nth frame and 
Hi is the incremental MCSHR ||20l for the whole tracklet k. 
Wi is the length of tracklet li. 
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As non-rigid targets, pedestrians are challenging objects to 
be tracked even with the help of the MCSHR. However, we 
can make some assumptions to help tracking. We assume that 
pedestrians are always walking at a constant speed in scenes, 
and the goal of our approach is to find the periodic time to 
segment the tracklets. 

All MCSHRs {/ii, •••, h^i} of the tracklet k are firstly 

obtained, and then the similarity Aj^j between any pair hk and 
hj is computed. The intuition is to compute all the possible 
periodic times and find the best one. For a certain periodic 
time t, the similarity Ajj-^t between hj and its next periodic 
hj^t is collected for every frame j, and the average similarity 
is considered as the value which determines the validity of this 
periodic time t. As shown in Fig. the periodic time with a 
highest validity is considered as our best periodic time for 
tracklet L. 


^ VJ i t 

Ti = argmax-- V Ajj+t Vt G [7,tUi/2). (11) 

t Wi — t ^' 

J=1 

The set [7, ti7^/2) is used to limit the possible range of t, and 
7 is set to 15. If 7 is too small, the nearby frames will have a 
strong similarity which causes Eq. [m to a false maximum. 
After calculation, is the best periodic time for tracklet 
li. Then the tracklet k can be evenly segmented into pieces 
with the length (except the end part). For each piece, the 
incremental MCSHR is computed. The PMCSHR of tracklet 
li is represented by •••, }, and di = is the 

number of pieces that the tracklet li is segmented into. 

Then every similarity between each two pieces from track- 
lets li and Ij are computed, and the average similarity 
Dis{li, Ij) is considered as the appearance similarity between 
two tracklets. 


di ,dj 

* ^ n=l,m=l 

( 12 ) 

where H^) is the similarity metric for two tracklets’ 

incremental MCSHRs. 

2 ) Motion Similarity: For a general method that is available 
in both overlapping and non-overlapping views, it’s hard to al¬ 
ways build an exact 3D coordinate system to project all scenes 
together. Hence, in this paper a relative distance between two 
tracklets is adopted to measure the motion similarity. For two 
tracklets li and Ij, it’s easy to get their interval time by a 



(a) (b) 


Fig. 6. Illustration of the enter/exit areas for the multi-camera visual 
object tracking. The enter/exit areas for links from Cam 1 to Cam 2 are in 
column (a), while those from Cam 2 back to Cam 1 are in column (b). The 
blue and yellow areas indicates the exit and enter areas respectively, and the 
red points represent the disappearing points. 


simple subtraction. If the two tracklets are likely to belong to 
one target, the interval time t^ must be a positive number. 

_ ^start _ ^end (13) 

where is the start time of tracklet Ij and is the end 
time of tracklet li. 

With the interval time the position and the 

velocity of tracklet li in the end time, we can predict 
the position where the tracklet li is behind ^ time. The new 
position can be calculated as below: 




..tail 


j.inv 


(14) 


For tracklet Ij, we can conduct the same thing and get its 
predicted position ttime ago. 


= ^head _ ^head ^ .^inv 
3 3 3 '^3 ' 


(15) 


As people always walk along a smooth path in real scene, 
we can assume that if the two tracklets belong to a same 
person, the corresponding predicted positions must be close to 
each other. In other words, x • and Xj should be close enough to 
^head ^end respectively. Therefore, the distances between 
predicted positions and original positions are used to represent 
the motion similarity between two tracklets (seen in Fig. [^. 

So the motion similarity in the single camera is computed 
as below: 


^ ) — 6Xp(^ ^ (^AXi -|- AXjf)) Si — Sj. (16) 

As shown in Eq. [T^ the relative distance is only valid 
for two tracklets from the same camera. If tracklets are from 
different cameras, the interval time is partly invalid. Becasue 
in inter-camera cases, the pathes between cameras are hard to 
measure which renders the interval time useless for predicting 
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Fig. 7. Illustration of the computing method for the minimum relative 
distance across cameras. In column (b), and are in exit and 

enter areas respectively, which indicate that both of Ax'^'^'^ and Ax'^'^^ are 
set to 0. The red lines in column (c) are AxY^'^'^ and Ax^^'^. 

^ 3 


positions. In this case, the relative distance mostly tends to be 
a huge wrong number. To handle this problem, a minimum 
relative distance is applied to compute the similarity across 
cameras, which is comparable with Eq. [T^ 

Enter/exit areas are commonly used in some uncalibrated 
camera systems to help to re-local exact positions of targets. 
Hence, we labeled enter/exit areas of each camera view with 
the help of topology information (seen Eig. |^. 

Eor a person, if she disappeared from an exit area, we would 
assume that she could be found in the enter area of the possible 
corresponding camera (seen in Eig. [7] (a)). If she disappeared 
from an area near a exit area, she could re-appear in the 
possible corresponding enter area with a high probability. 
Under this assumption, we manually set a disappearing point 
for each area to connect cameras. Then a minimum relative 
distance to the disappearing point during the whole 

interval time is adopted to measure the motion similarity 
across cameras instead of the original relative distance Axi, 
seen in Eig. |7] (b) and (c). 




min 11 xj^^^ -h 11 2 , 

if ^ Areaexit, 

0 

if G Areaexit- 


(17) 


Axf^n 



min 


0 


^^xhead _vhead^^_^entery 

if ^ AreUenter 

if e AreUenter- 

(18) 


Pmih ^ h) = expi-^iAxr^ + Aa;f")), (19) 

where and are the positions of the disappearing 

points for the enter area and the exit area in camera Si 
respectively. 

Another benefit of the minimum relative distance is that it 
is measured in each camera which can be compared with the 


relative distance. With its help, the motion similarity metric 
can be extend from a single camera to a multi-camera system 
and can be considered as well equalized in the global graph. 

The final equalized motion similarity metric is: 

P (1 \ exp{-^{Axi + Axj)) if Si = Sj 

^m{h ^h) I ea;p(-|(Aa;™” + Aa;™”)) if s* ^ sj, 

( 20 ) 

where A is set to 0.01 in the experiments. 

IV. Equalized Graph Model 

During tracking objects in a single camera, we assume 
that observations are obtained under the same circumstance, 
like illumination and angle of view. Hence the targets would 
have a strong invariance in their appearance representations 
which can further be used for tracking. During inter-camera 
object tracking, this invariance is weaker due to the changes 
in different circumstances. When we establish the graph with 
nodes and edges, this phenomenon would cause the inter¬ 
camera similarities being much lower than the similarities in 
single camera. If we use Eq. to compute the appearance 
similarities and provide no alignment or equalization for two 
similarity distributions, it would result in that the optimization 
process links the edges in the single camera preferentially all 
the time and ignores the inter-camera links as long as there is 
a edge with a higher similarity in the same camera. It’s hard 
to get an accurate alignment for two similarity distributions, 
and the proposed approach offers a suitable alignment which 
can be considered as a compensation for the inter-camera 
similarities. Our purpose is to equalize the difference between 
two similarity distributions and at the same time manage to 
keep the distribution of the inter-camera similarity not affected. 
So our equalization is mainly processed on the distribution of 
the single camera similarity and make it close to the inter¬ 
camera similarity distribution. 


Paih Ij) = Aa{Dis{li,lj) - Ap), 

Ap > 0,Si = Sj, 

where Aa and A/i are the compensation factors, the similarity 
Dis{li3j) between tracklets k and Ij is obtained by Eq. 

The factor A/i is used to improve the average level of 
the single camera similarity distribution and the factor A/i 
is adopted to control the amplitude of variation. They are 
computed from two similarity distributions. 
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Ap = Pi - P2, 

Act = ( 72 /a-i, 

where /ii and ai are the mean and variance of the single 
camera similarity distribution. These should be computed by 
all the single camera edges. And /i 2 and (72 are of the inter¬ 
camera similarity distribution and should be got from all the 
inter-camera edges. 

However, not all the similarities of edges are reliable and 
suitable to compute the mean and variance. Some have a large 
proportion of noises and should be excluded as outliers. In 
this paper, a minimum uncertain gap (MUG) ca is brought 
in to help to filtrate edges used for computing the mean and 
























variance. The MUG is used to measure the uncertainties of the 
likelihoods between tracklets. The tracklet link with a small 
MUG can be considered as a more reliable link, because its 
similarity is more stable and more believable. As a result, the 
MUG is treated as a confidence factor for edges. 

MUG{li,lj) = mdix - min Sim{Hl, Wj, 

n e [1, di], m G [1, dj]. 

(23) 

Therefore, with the help of MUG’s filtration, the mean and 
variance are computed as follows: 

/ii = MEAN{Dis{li, Ij)) cTi = VAR{Dis{li, Ij)), 
MUIj^ ^ Si = Sj. 

(24) 


new pedestrian dataset is constructed in this paper for multi¬ 
camera object tracking to facilitate the tracking evaluation. 

The NLPR_MCT dataseH consists of four sub-datasets. 
Each sub-dataset includes 3-5 cameras with non-overlapping 
scenes and has a different situation according to the number of 
people (ranging from 14 to 255) and the level of illumination 
changes and occlusions. The collected videos contain both real 
scenes and simulation environments. We also list the topolog¬ 
ical connection matrixes for pedestrian walking areas. All the 
videos are nearly 20 minutes (except Dataset 3) with a rate of 
25 fps and are recorded under non-overlapping views during 
daily time, which make the dataset a good representation of 
different situations in normal life. The connection relationships 
between scenes are shown in Fig.[^ where the enter/exit areas 
for this paper are also marked. 


fi 2 = MEAN{Dis{li,lj)) (72 = V AR{Dis{li,lj)), 
MUG{li, Ij) < £,Si^ Sj, 

(25) 

where 5 is a confidence threshold, MEAN() and VAR() are 
the mean and variance operations respectively. 

And the final equalized appearance similarity metric would 
become: 


Pa{h h) — 


Dis(li, Ij) 


if •) 


Aa{Dis{li, Ij) — A/i) if Si = Sj 


(26) 


V. Experiment Results 

In this section, the proposed approach is evaluated based 
on the following aspects. First, the global graph model is 
compared with the traditional two-step framework, where we 
use the same feature representation for fairness. Second, a 
performance comparison between the equalized graph and the 
non-equalized one is provided to prove the effectiveness of 
the equalization process with the improved similarity metric. 
Third, the proposed approach is compared with some state- 
of-the-art Multi-Camera Tracking (MCT) methods. However, 
as there’re no benchmark for MCT, we introduce a dataset 
and a comprehensive evaluation criterion first, which can 
be developed as a benchmark in further works. The dataset 
is specialized for multi-camera pedestrian tracking in non¬ 
overlapping cameras, called NLPR_MCT dataset. The details 


of the dataset are presented in Section V-A The proposed 
evaluation criteria for MCT is introduced in Section IV-BI 


A. Datasets 

For a comprehensive performance evaluation, it is cru¬ 
cial to develop a representative dataset. There are several 
datasets for visual tracking in the surveillance scenarios, such 
as PETS (fill, CAVIAR (63, TUD fSl and i-LIDS El 
databases. However, most of them are designed for multi¬ 
object tracking in a single camera and are not suitable for 
inter-camera object tracking. PETS is under a simulation 
environment with overlapping cameras, not in real scene, while 
i-LIDS aims to serve multi-camera object tracking indoor and 
the ground truthes are not for free so far. For these reasons, a 


B. Evaluation Criteria 

As we know, both SCT and ICT have their own evaluation 
criteria. Most SCT trackers usually use the multi-object track¬ 
ing accuracy (MOTA) and ID switch {65\ as their evaluation 
criteria, while some SCT papers prefer other terms (TTl, (21, 
(41 . In ICT, the ID switch is also a necessary term. 

There are two criteria mentioned in Section U which are 
important to a multi-camera multi-object tracking system. The 
SCT module and the ICT module correspond to the two criteria 
respectively. As these two criteria are equally crucial for multi¬ 
camera object tracking performance, they should be considered 
equally important in the final performance measurement. 

Nevertheless, in today’s multi-camera object tracking, there 
is rarely a widely accepted performance measurement that 
takes these two criteria into account. The common criterion 
researchers used for multi-camera object tracking is an ex¬ 
tension of MOTA. It adds the ID switches in SCT and in 
ICT together, which ignores the different incidence densities 
of the ID switches in SCT and ICT. In most video scenes, 
i. e. Table |I^ the ground truthes used for frame matching in 
SCT are much more than those in ICT. It leads to trackers 
caring more about the trajectories in single camera rather 
than the inter-camera matching. In this paper, we treat them 
separately and provide a new evaluation criterion to measure 
the performance of multi-camera object tracking. Our criterion 
takes both of SCT and ICT criteria into account and uniform 
them into one evaluation metric. The metric is called multi¬ 
camera object tracking accuracy (MCTA): 


MGTA = Detection * Tracking^^^ * Tracking^^^ 


= ( 


2^ Precisions Recall 
Precision-^ Recall 


)(i- 




^)(i- 


E. 


-)• 


Et‘p' 

(27) 

It’s also modified based on MOTA (65]| and can be applied 
on multi-camera object tracking. It avoids the disadvantage 
of MOTA that can be negative due to the false positives. 
The MCTA ranges for 0 to 1. The metric contains three 
parts: detection ability, SCT ability and ICT ability, which are 


corresponding to the three brackets in Eq. The Precision 
and Recall are integrated by El-score to measure the detection 


^http://mct.idealtest.org/Datasets.html 
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Dataset 4: Outdoor Scene 


Dataset 4: Outdoor Scene 


Fig. 8. Illustration of the topological relationship during tracking. The topological relationships for every dataset are shown in the right column, and 
the blue polygons stand for enter/exit areas used in our experiments for Dataset 1-4. 

TABLE II 

The single-camera and inter-camera ground truthes eor all lour sub-datasets. 


Dataset I 

Dataset2 

Dataset3 

Dataset4 

Tracking^ 

Tracking^ 

Tracking^^^ 

Tracking^ 

Tracking^ 

Tracking^ 

Tracking^ 

Tracking^ 

71853 

334 

88419 

408 

I8I87 

152 

42615 

256 


power and the occlusion handling ability. In this paper, the 
experiments focus on testing the SCT and the ICT abilities of 
the proposed approach, so for the first two experiments, we use 
the ground truthes of object detections as the inputs instead 
of running a real detector, which leads to Precision = 1 and 
Recall = 1. In the last experiment, a DPM l(59ll detector is 
used to get the detection results. 


The tp^ and tp^ are the matching numbers of frames in ground 
truthes. contains the matchings, the two frames of which 
are from the same camera, and tp^ means the number of those 
inter-camera matchings. It is worth noting that both tpl and 
tp^ are among the truth positive detection results. For a new 
target, it’s counted as an inter-camera ground truth by default 
in our criterion. 


^ j. j. • _ 2^ Precisions Recall 

— Precision+Recall ’ 


Precision = 1 — 
Recall = 1 — 


E, fp* 




(28) 


where fpt, rt, mt and gt are the number of false positives, 
hypothesises, misses and ground truthes respectively in time 

t. 


^ ^ mmef 

Tracking^^^ = 1 — , 

^ , Tr^'-r mmef 

Tracking^^^ = 1 — -. 


(29) 


For SCT and ICT ability parts, we measure the abilities via 
the number of mismatches (ID-switches). We split the number 
of mismatches mmcf in MOTA |[65]| into mme| and mme^. 
mmel represents the number of mismatches happened in a 
single camera and mme^ is for those inter-camera mismatches. 


C. Global Graph Model Two-Step Framework 

The advantage of the proposed method is to improve the 
ICT performance under an unperfect SCT result. So in this 
section, the proposed global graph model is compared with 
the traditional two-step framework, i. e. a SCT approach plus 
an ICT approach. We use the same MAP model to solve the 
data association in both SCT and ICT steps in the two-step 
framework and aim to remove the interference of different 
data association methods. Adopting the MAP model in SCT 
is presented in Zhang et al ED. However using MAP model 
in ICT is not a suitable solution when the tracking results 
in single camera are perfect and unchangeable. But as we 
said in Section. inii when the SCT results are not ideal, 
the data association in ICT should be more like a global 
optimization problem rather than a K-partite graph matching 
problem, which can be solved by the MAP model. That’s 
another reason why we use the MAP model to achieve the data 
association in ICT in the traditional two-step framework. As a 
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SCT Mismatch of Dataset! 


SCT Mismatch of Dataset2 


SCT Mismatch of Dataset3 


SCT Mismatch of Dataset4 



Fig. 9. Performance evaluation of the proposed approach under different parameter settings. The x-coordinate for all the figures is the confidence 
threshold 6 of the AIF tracker, and the number in bracket is the corresponding number of tracklets. With the increase of 6, the tracklet number grows and more 
tracklet fragments are produced. The y-coordinates in three rows are the SCT mismatch number, the ICT mismatch number and the MCTA score respectively. 
The performance score under ^ = 0 is shown in the legend. The method of global graph is the proposed approach. The two-step with MAP is Zhang’s work 
im which uses MAP to achieve the SCT process. The two-step with MAP and Hungary in the last two row stand for the approaches that solve the ICT 
problem with MAP and Hungary algorithm ED- 


TABLE III 

Empirical comparison of the proposed approach on pour 

MULTI-CAMERA TRACKING DATASETS. THE BOLD INDICATES THE BEST 
PERPORMANCE. 



NonA 

EqlA 

M 

EqlA+M 

Dataset 1 

mme^ 

71 

76 

53 

66 

mme^ 

123 

88 

101 

49 

MCTA 

0.6311 

0.7357 

0.6971 

0.8525 

Dataset2 

mme^ 

83 

109 

67 

93 

mme^ 

201 

164 

126 

107 

MCTA 

0.5069 

0.5973 

0.6907 

0.7370 

Datasets 

mme^ 

59 

71 

74 

51 

mme^ 

132 

116 

95 

80 

MCTA 

0.1312 

0.2359 

0.3735 

0.4724 

Dataset4 

mme^ 

125 

137 

123 

128 

mme^ 

187 

169 

188 

159 

MCTA 

0.2687 

0.3388 

0.2649 

0.3778 

AverageMCT A 

0.3845 

0.4769 

0.5066 

0.6099 


complement, we also utilize Hungary algorithm ll47ll to achieve 
the ICT step, which is a classical data association method 
for ICT. The feature representation in this experiment is the 
PMCSHR appearance and motion features for all baselines 
due to the fairness reason. 

In this experiment, the waiting time threshold 77 and the 
minimum value 5 of the MUG are set to 60*25*1 and 0.4 
respectively, the weights of two features ki and k 2 are both 
1. To prove the ability of the proposed approach handling 
unperfect tracklets in SCT, the experiment changes the thresh¬ 
old 0 of the confidence of the AIF tracker to produce more 
fragments artificially. The threshold 0 ranges from 0 to 0.2 
and the corresponding numbers of tracklets are listed beside 
the threshold in Fig. 

The total single-camera matching number tp^ and inter¬ 
camera matching number tp^ of ground truthes for each sub¬ 
dataset are listed in Table |IQ From the first two rows in Fig.|^ 
we can see that with the increase of the fragmented tracklet 
number, both the single camera mismatch number mme^ and 
the inter-camera mismatch number mme^ grow significantly 
in the proposed global graph and the two-step framework. In 
the first row, the single camera mismatch number mme^ in the 
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TABLE IV 

Performance comparison using the ground truthes of single 

CAMERA OBJECT TRACKING AS INPUT. 


TABLE V 

Performance comparison using the ground truthes of object 

DETECTION AS INPUT. 



Ours 

USC-Vision 

Ea+En 

Hfutdspmct 

l54l 

CRIPAC-MCT 

(m 

Dataset I 

mme^ 

55 

27 

86 

113 

MCTA 

0.8353 

0.9152 

0.7425 

0.6617 

Dataset2 

mme^ 

I2I 

34 

I4I 

167 

MCTA 

0.7034 

0.9132 

0.6544 

0.5907 

Datasets 

mme^ 

39 

70 

40 

44 

MCTA 

0.7417 

0.5163 

0.7368 

0.7105 

Dataset4 

mme^ 

157 

72 

155 

no 

MCTA 

0.3845 

0.7052 

0.3945 

0.5703 

AverageMCT A 

0.6662 

0.7625 

0.6321 

0.6333 


proposed global graph is always larger than that in the two- 
step framework 112111 , because the two-step framework offers 
an optimization in each camera which makes it have a better 
local result. In dataset 3 and dataset 4, the mme^ in the 
proposed global graph becomes lower than that in the two- 
step framework ED. The reason is that these two datasets 
are under a simulation condition which have many frequent 
“walking around” behaviors. In this case, the inter-camera 
information may provide more useful feedbacks for each 
specific camera and can partly improve the SCT performance. 
For the inter-camera mismatch number mme^ in the middle 
row, the number in the proposed global graph is much lower 
than that in both MAP and Hungary graph El in the two-step 
framework, it indicates the effectiveness of our global graph 
model to improve the ICT performance. In dataset 4, it can be 
seen that the mmec in the proposed graph is not smaller than 
that in the two-step framework at first time. However, with the 
increase of fragmented tracklets, the mmec in the proposed 
graph increases much more slowly and finally becomes smaller 
than that in the two-step framework. What’s more, as the ICT 
step in two-step framework, the data association method based 
on the global MAP is always better than that with Hungary 
algorithm El- It can partly prove the assumption that the 
data association in ICT is more suitable to be treat as a global 
optimization problem rather than a K-partite graph matching 
problem because of non-ideal SCT results. In the last row, 
the MCTA of the global MAP always keep the highest score, 
which implies that the proposed global graph model offers 
a better performance compared with the traditional two-step 
framework. 



Ours 

USC-Vision 

iia+ET] 

Hfutdspmct 

l54l 

CRIPAC-MCT 

ca 

Dataset I 

mme^ 

66 

63 

77 

135 

mme^ 

49 

35 

84 

103 

MCTA 

0.8525 

0.8831 

0.7477 

0.6903 

Dataset2 

mme^ 

93 

61 

109 

230 

mme^ 

107 

59 

140 

153 

MCTA 

0.7370 

0.8397 

0.6561 

0.6234 

Datasets 

mme^ 

51 

93 

105 

147 

mme^ 

80 

III 

I2I 

139 

MCTA 

0.4724 

0.2427 

0.2028 

0.0848 

Dataset4 

mme^ 

128 

70 

97 

140 

mme^ 

159 

I4I 

188 

209 

MCTA 

0.3778 

0.4357 

0.2650 

0.1830 

AverageMCT A 

0.6099 

0.6003 

0.4679 

0.3954 


results with the non-equalized and the equalized appearance 
features. M is corresponding to the results with the equalized 
motion feature only and EqlA-^M is the one that combines 
the equalized appearance feature and the motion feature to¬ 
gether. It can be found that the result with the non-equalized 
appearance similarity has a lower mismatch number mme^ 
in the single camera compared with the equalized one. It 
means that when we conduct equalization, the single camera 
performance drops down due to the change of the distribution 
of the single camera similarity, and that is unavoidable but 
acceptable. In the inter-camera tracking, it is clear that the 
equalized appearance similarity tracker gives a great help 
to reduce the number mme^ of mismatches across cameras. 
When the equalized motion information is added in, the mme^ 
further decreases. The MCTA is the final comprehensive score 
which takes both SCT and ICT performances into account. 
The larger the score is, the better performance the tracker 
has. As seen in Table |II^ the equalized appearance similarity 
result combined with the equalized motion information has 
a highest score. It indicates that the increased single camera 
mismatch number mme^ in our method is acceptable in order 
to reduce the inter-camera mismatch number mme^ and get 
a higher score in the whole MCT performance. Further more, 
when we use the motion feature alone for the multi-camera 
object tracking, the performance is comparable and sometimes 
better than the appearance feature, which partly proves the 
effectiveness of our equalized motion similarity metric. 


D. Equalized Non-equalized Graph Model 

This experiment is conducted to prove the effectiveness 
of the similarity equalization process. All the trackers are 
under our global graph model. We compare the equalized 
appearance similarity metric with the non-equalized one and 
then combined with our equalized motion metric. Particularly, 
in this experiment, the confidence threshold 0 of the AIF 
tracker is fixed and set to 0. 

The results are shown on Table [Tin NonA and EqlA are the 


E. Equalized Global Graph Model State of The Arts 

In this section, we compare our equalized global MAP 
graph model with other multi-camera object tracking methods. 
As a comparison, the methods must contain the abilities to 
handle both the SCT and the ICT steps. We compare the 
proposed graph with current two-step multi-camera object 
tracking methods. The methods are from the Multi-Camera 
Object Tracking (MCT) Challenge 03. USC-Vision ( IID, 





































12 


1411 ) is the winner in the challenge which is considered 
as the state-of-the-art two-step multi-camera object tracking 
approach. We first conduct the comparison under the condition 
that the ground truthes of single camera object tracking are 
available, the results are shown in Table |IV| It refiects the 
ICT power of each method when the single camera object 
tracking results are perfect. From the average MCTA score 
we can see that USC-Vision (|[32]|, 11411 ) is much better than 
our proposed method. This proves the advantage of USC- 
Vision’s ICT method. In Table |Vj only the ground truthes of 
object detections are available, the tracker should achieve the 
single camera object tracking by themselves. On this occasion, 
their results of the single camera object tracking can’t be as 
perfect as the ground truthes, and their inter-camera object 
tracking algorithms have to bear these fragments and false 
positives. From Table |V| although the SCT performance mme^ 
of USC-Vision (|^, (HI) is better than ours, it is clear 
that the number of its ICT mismatches increases much more 
shapely than our method’s, which indicates that its powerful 
ICT method loses its advantage under the unperfect SCT 
results. Results are shown in Fig. [T^ As the final evaluation, 
our equalized global graph model has the highest average 
MCTA score, which further proves the advantage of our 
proposed model on improving the ICT performance under an 
unperfect SCT result. At last, as perfect detection can never 
be achieved in reality, we do another experiment without the 
detection ground truthes. We uses the DPM detector |[59]| 
to get the detection results. In Table VI the Tracking^^^ 


and Tracking ^corresponding to Eq. 29 are listed instead 
of mme because the different detection results would cause 
different tps. From the results in Table |Vl| it shows that our 
result is not the best but can be comparable with the state of the 
arts. Under a real detector, there would be much missing and 
false positive detections. The ability of a multi-camera tracker 
to handle these missing or false positive detections mainly 
comes from its SCT part. USC-Vision uses a hierarchical 
association to build its tracklets, in which the detections 
are selected discreetly and some missing detections can be 
partly complemented. In our method, a real-time single object 
tracker (601 is adopted to get the tracklets, which can partly 
handle missing detections. But for the false detections, once 
the tracker drifts to a false detection, it would cause the whole 
tracklet unreliable. Due to the benefits of the hierarchical 
association in the SCT step, USC-Vision has a more reliable 
set of tracklets than those we have for the next ICT step. 
Even with the help of the proposed equalized global graph, our 
final result is still a little lower than USC-Vision’s. This can’t 
deny the effectiveness of our equalized global graph model, but 
prove the advantage of USC-Vision’s SCT method to handle 
misses and false positives. However, for practical usages in real 
environment, the detection-level association is much slower 
than a real-time single camera tracker. That’s why we use the 
AIF tracker to get tracklets in our method instead of using 
USC-Vision’s detection-based hierarchical association. Some 
other single object trackers, such as TLD (63, may handle 
the false-detection drifts by their online learning mechanisms. 
But it costs too much time and memories on learning the 
online models, which is hard to be applied on forming our 


TABLE VI 

Performance comparison without the ground truthes of 

OBJECT DETECTION. THE FINAL MCTA IS SHOWN AS BOLD FOR CLARITY. 



Ours 

USC-Vision 

(ni+En 

Hfutdspmct 

(54] 

CRIPAC-MCT 

Dataset 1 

precision 

0.7967 

0.6916 

0.7113 

0.1488 

recall 

0.5929 

0.6061 

0.3465 

0.2154 

Tracking^^^ 

0.9744 

0.9981 

0.9229 

0.9955 

Tracking^^^ 

0.6220 

0.9288 

0.6534 

0.7111 

MCTA 

0.4120 

0.5989 

0.2810 

0.1246 

Dataset2 

precision 

0.7977 

0.6948 

0.7461 

0.1431 

recall 

0.6332 

0.7843 

0.3669 

0.1933 

Tracking^^^ 

0.9779 

0.9986 

0.9347 

0.9945 

Tracking^ 

0.6942 

0.8507 

0.6122 

0.7510 

MCTA 

0.4793 

0.6260 

0.2815 

0.1075 

Datasets 

precision 

0.8207 

0.4750 

0.3342 

0.0853 

recall 

0.5345 

0.6615 

0.0986 

0.1206 

Tracking^^^ 

0.9749 

0.9904 

0.9682 

0.9715 

Tracking^ 

0.2953 

0.1014 

0.2432 

0.1143 

MCTA 

0.1864 

0.0555 

0.0359 

0.0111 

Dataset4 

precision 

0.8355 

0.5216 

0.7720 

0.0606 

recall 

0.6193 

0.79375 

0.1210 

0.0944 

Tracking^^^ 

0.9275 

0.9948 

0.9865 

0.9762 

Tracking^ 

0.4308 

0.5437 

0.2944 

0.2950 

MCTA 

0.2842 

0.3404 

0.0608 

0.0213 

AverageMCTA 

0.3405 

0.4052 

0.1648 

0.0661 


raw tracklets. As a result, a real-time single camera tracker 
that can deal with the false detections is a promising further 
work for multi-camera object tracking. 

VI. Conclusion 

In order to address the problem of multi-camera non¬ 
overlapping visual object tracking, we develop a joint ap¬ 
proach that optimising the single camera object tracking and 
the inter-camera object tracking in one graph. This joint 
approach overcomes the disadvantages in the traditional two- 
step tracking approaches. In addition, the similarity metrics of 
both appearance and motion features in the proposed global 
graph are equalized. The equalization further reduces the 
number of mismatch errors in inter-camera object tracking. 
The results show its effectiveness for multi-camera object 
tracking, especially when the SCT performance is not perfect. 
Our approach focuses on the graph modeling instead of the 
feature representation learning. Any existing re-identification 
feature representation method can be incorporated into our 
framework. 
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